Today, we will start creating beautiful visualization with R.

Load the tidyverse package

First, let’s load the tidyverse library, using the command library(tidyverse):

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.5.2
## -- Attaching packages ------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0       v purrr   0.3.0  
## v tibble  2.0.1       v dplyr   0.8.0.1
## v tidyr   0.8.2       v stringr 1.4.0  
## v readr   1.3.1       v forcats 0.4.0
## Warning: package 'ggplot2' was built under R version 3.5.2
## Warning: package 'tibble' was built under R version 3.5.2
## Warning: package 'tidyr' was built under R version 3.5.2
## Warning: package 'readr' was built under R version 3.5.3
## Warning: package 'purrr' was built under R version 3.5.2
## Warning: package 'dplyr' was built under R version 3.5.2
## Warning: package 'stringr' was built under R version 3.5.2
## Warning: package 'forcats' was built under R version 3.5.2
## -- Conflicts ---------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Data manipulation withdplyr

To manage data frames, we will use the dplyr library, already downloaded with tidyverse.

dplyr Grammar

Some of the key “verbs” provided by the dplyr package are

  • select: return a subset of the columns of a data frame, using a flexible notation
  • filter: extract a subset of rows from a data frame based on logical conditions
  • arrange: reorder rows of a data frame
  • rename: rename variables in a data frame
  • mutate: add new variables/columns or transform existing variables
  • summarise / summarize: generate summary statistics of different variables in the data frame, possibly within strata
  • %>%: the “pipe” operator is used to connect multiple verb actions together into a pipeline

To present the next examples, we will be using a dataset containing air pollution and temperature data for the city of Chicago in the U.S.

chicago <- readRDS("chicago.rds")
dim(chicago)
## [1] 6940    8
str(chicago)
## 'data.frame':    6940 obs. of  8 variables:
##  $ city      : chr  "chic" "chic" "chic" "chic" ...
##  $ tmpd      : num  31.5 33 33 29 32 40 34.5 29 26.5 32.5 ...
##  $ dptp      : num  31.5 29.9 27.4 28.6 28.9 ...
##  $ date      : Date, format: "1987-01-01" "1987-01-02" ...
##  $ pm25tmean2: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ pm10tmean2: num  34 NA 34.2 47 NA ...
##  $ o3tmean2  : num  4.25 3.3 3.33 4.38 4.75 ...
##  $ no2tmean2 : num  20 23.2 23.8 30.4 30.3 ...

select()

The select() function can be used to select columns of a data frame.

Suppose we wanted to take the first 3 columns only. There are a few ways to do this. We could for example use numerical indices. But we can also use the names directly.

names(chicago)[1:3]
## [1] "city" "tmpd" "dptp"
subset <- select(chicago, city:dptp)
head(subset)
##   city tmpd   dptp
## 1 chic 31.5 31.500
## 2 chic 33.0 29.875
## 3 chic 33.0 27.375
## 4 chic 29.0 28.625
## 5 chic 32.0 28.875
## 6 chic 40.0 35.125

You can also omit variables using the select() function by using the negative sign.

subset_omit <- select(chicago, -(city:dptp))
head(subset_omit)
##         date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
## 1 1987-01-01         NA   34.00000 4.250000  19.98810
## 2 1987-01-02         NA         NA 3.304348  23.19099
## 3 1987-01-03         NA   34.16667 3.333333  23.81548
## 4 1987-01-04         NA   47.00000 4.375000  30.43452
## 5 1987-01-05         NA         NA 4.750000  30.33333
## 6 1987-01-06         NA   48.00000 5.833333  25.77233

The select() function also allows a special syntax that allows you to specify variable names based on patterns.

To keep every variable that ends with a “2”, we could do

subset_vars_end2 <- select(chicago, ends_with("2"))
str(subset_vars_end2)
## 'data.frame':    6940 obs. of  4 variables:
##  $ pm25tmean2: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ pm10tmean2: num  34 NA 34.2 47 NA ...
##  $ o3tmean2  : num  4.25 3.3 3.33 4.38 4.75 ...
##  $ no2tmean2 : num  20 23.2 23.8 30.4 30.3 ...

To keep every variable that starts with a “d”, we could do

subset_vars_startd <- select(chicago, starts_with("d"))
str(subset_vars_startd)
## 'data.frame':    6940 obs. of  2 variables:
##  $ dptp: num  31.5 29.9 27.4 28.6 28.9 ...
##  $ date: Date, format: "1987-01-01" "1987-01-02" ...

filter()

The filter() function is used to extract subsets of rows from a data frame. This function is similar to the existing subset() function in R but is quite a bit faster.

Suppose we wanted to extract the rows of the chicago data frame where the levels of PM2.5 are greater than 30 (which is a reasonably high level), we could do

chic.f <- filter(chicago, pm25tmean2 > 30)
str(chic.f)
## 'data.frame':    194 obs. of  8 variables:
##  $ city      : chr  "chic" "chic" "chic" "chic" ...
##  $ tmpd      : num  23 28 55 59 57 57 75 61 73 78 ...
##  $ dptp      : num  21.9 25.8 51.3 53.7 52 56 65.8 59 60.3 67.1 ...
##  $ date      : Date, format: "1998-01-17" "1998-01-23" ...
##  $ pm25tmean2: num  38.1 34 39.4 35.4 33.3 ...
##  $ pm10tmean2: num  32.5 38.7 34 28.5 35 ...
##  $ o3tmean2  : num  3.18 1.75 10.79 14.3 20.66 ...
##  $ no2tmean2 : num  25.3 29.4 25.3 31.4 26.8 ...
summary(chic.f$pm25tmean2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   30.05   32.12   35.04   36.63   39.53   61.50

We can place an arbitrarily complex logical sequence inside of filter(), so we could for example extract the rows where PM2.5 is greater than 30 and temperature is greater than 80 degrees Fahrenheit.

chic.f <- filter(chicago, pm25tmean2 > 30 & tmpd > 80)
select(chic.f, date, tmpd, pm25tmean2)
##          date tmpd pm25tmean2
## 1  1998-08-23   81   39.60000
## 2  1998-09-06   81   31.50000
## 3  2001-07-20   82   32.30000
## 4  2001-08-01   84   43.70000
## 5  2001-08-08   85   38.83750
## 6  2001-08-09   84   38.20000
## 7  2002-06-20   82   33.00000
## 8  2002-06-23   82   42.50000
## 9  2002-07-08   81   33.10000
## 10 2002-07-18   82   38.85000
## 11 2003-06-25   82   33.90000
## 12 2003-07-04   84   32.90000
## 13 2005-06-24   86   31.85714
## 14 2005-06-27   82   51.53750
## 15 2005-06-28   85   31.20000
## 16 2005-07-17   84   32.70000
## 17 2005-08-03   84   37.90000

arrange()

The arrange() function is used to reorder rows of a data frame according to one of the variables/columns.

We can order the rows of the data frame by date, so that the first row is the earliest (oldest) observation and the last row is the latest (most recent) observation.

head(chicago, 10)
##    city tmpd   dptp       date pm25tmean2 pm10tmean2  o3tmean2 no2tmean2
## 1  chic 31.5 31.500 1987-01-01         NA   34.00000  4.250000  19.98810
## 2  chic 33.0 29.875 1987-01-02         NA         NA  3.304348  23.19099
## 3  chic 33.0 27.375 1987-01-03         NA   34.16667  3.333333  23.81548
## 4  chic 29.0 28.625 1987-01-04         NA   47.00000  4.375000  30.43452
## 5  chic 32.0 28.875 1987-01-05         NA         NA  4.750000  30.33333
## 6  chic 40.0 35.125 1987-01-06         NA   48.00000  5.833333  25.77233
## 7  chic 34.5 26.750 1987-01-07         NA   41.00000  9.291667  20.58171
## 8  chic 29.0 22.000 1987-01-08         NA   36.00000 11.291667  17.03723
## 9  chic 26.5 29.000 1987-01-09         NA   33.28571  4.500000  23.38889
## 10 chic 32.5 27.750 1987-01-10         NA         NA  4.958333  19.54167
chicago <- arrange(chicago, date)

We can now check the first few rows

head(select(chicago, date, pm25tmean2), 3)
##         date pm25tmean2
## 1 1987-01-01         NA
## 2 1987-01-02         NA
## 3 1987-01-03         NA

and the last few rows.

tail(select(chicago, date, pm25tmean2), 3)
##            date pm25tmean2
## 6938 2005-12-29    7.45000
## 6939 2005-12-30   15.05714
## 6940 2005-12-31   15.00000

Columns can be arranged in descending order too by using the special desc() operator.

chicago <- arrange(chicago, desc(date))

Looking at the first three and last three rows shows the dates in descending order.

head(select(chicago, date, pm25tmean2), 3)
##         date pm25tmean2
## 1 2005-12-31   15.00000
## 2 2005-12-30   15.05714
## 3 2005-12-29    7.45000
tail(select(chicago, date, pm25tmean2), 3)
##            date pm25tmean2
## 6938 1987-01-03         NA
## 6939 1987-01-02         NA
## 6940 1987-01-01         NA

rename()

The rename() function is designed to rename a variable in a data frame.

Here you can see the names of the first five variables in the chicago data frame.

head(chicago[, 1:5], 3)
##   city tmpd dptp       date pm25tmean2
## 1 chic   35 30.1 2005-12-31   15.00000
## 2 chic   36 31.0 2005-12-30   15.05714
## 3 chic   35 29.4 2005-12-29    7.45000

The dptp column is supposed to represent the dew point temperature and the pm25tmean2 column provides the PM2.5 data. However, these names are pretty obscure or awkward and probably be renamed to something more sensible.

chicago <- rename(chicago, dewpoint = dptp, pm25 = pm25tmean2)
head(chicago[, 1:5], 3)
##   city tmpd dewpoint       date     pm25
## 1 chic   35     30.1 2005-12-31 15.00000
## 2 chic   36     31.0 2005-12-30 15.05714
## 3 chic   35     29.4 2005-12-29  7.45000

The syntax inside the rename() function is to have the new name on the left-hand side of the = sign and the old name on the right-hand side.

mutate()

The mutate() function exists to compute transformations of variables in a data frame.

For example, with air pollution data, we often want to detrend the data by subtracting the mean from the data. That way we can look at whether a given day’s air pollution level is higher than or less than average (as opposed to looking at its absolute level).

Here we create a pm25detrend variable that subtracts the mean from the pm25 variable.

chicago <- mutate(chicago, pm25detrend = pm25 - mean(pm25, na.rm = TRUE))
head(chicago)
##   city tmpd dewpoint       date     pm25 pm10tmean2  o3tmean2 no2tmean2
## 1 chic   35     30.1 2005-12-31 15.00000       23.5  2.531250  13.25000
## 2 chic   36     31.0 2005-12-30 15.05714       19.2  3.034420  22.80556
## 3 chic   35     29.4 2005-12-29  7.45000       23.5  6.794837  19.97222
## 4 chic   37     34.5 2005-12-28 17.75000       27.5  3.260417  19.28563
## 5 chic   40     33.6 2005-12-27 23.56000       27.0  4.468750  23.50000
## 6 chic   35     29.6 2005-12-26  8.40000        8.5 14.041667  16.81944
##   pm25detrend
## 1   -1.230958
## 2   -1.173815
## 3   -8.780958
## 4    1.519042
## 5    7.329042
## 6   -7.830958

group_by()

The group_by() function is used to generate summary statistics from the data frame within strata defined by a variable. For example, in this air pollution dataset, you might want to know what the average annual level of PM2.5 is. So the stratum is the year, and that is something we can derive from the date variable. In conjunction with the group_by() function we often use the summarize() function.

The general operation here is a combination of splitting a data frame into separate pieces defined by a variable or group of variables (group_by()), and then applying a summary function across those subsets (summarize()).

First, we can create a year varible using as.POSIXlt().

chicago <- mutate(chicago, year = as.POSIXlt(date)$year + 1900)
head(chicago)
##   city tmpd dewpoint       date     pm25 pm10tmean2  o3tmean2 no2tmean2
## 1 chic   35     30.1 2005-12-31 15.00000       23.5  2.531250  13.25000
## 2 chic   36     31.0 2005-12-30 15.05714       19.2  3.034420  22.80556
## 3 chic   35     29.4 2005-12-29  7.45000       23.5  6.794837  19.97222
## 4 chic   37     34.5 2005-12-28 17.75000       27.5  3.260417  19.28563
## 5 chic   40     33.6 2005-12-27 23.56000       27.0  4.468750  23.50000
## 6 chic   35     29.6 2005-12-26  8.40000        8.5 14.041667  16.81944
##   pm25detrend year
## 1   -1.230958 2005
## 2   -1.173815 2005
## 3   -8.780958 2005
## 4    1.519042 2005
## 5    7.329042 2005
## 6   -7.830958 2005

Now we can create a separate data frame that splits the original data frame by year.

years <- group_by(chicago, year)

Finally, we compute summary statistics for each year in the data frame with the summarize() function.

summarize(years, pm25 = mean(pm25, na.rm = TRUE), 
           o3 = max(o3tmean2, na.rm = TRUE), 
           no2 = median(no2tmean2, na.rm = TRUE))
## # A tibble: 19 x 4
##     year  pm25    o3   no2
##    <dbl> <dbl> <dbl> <dbl>
##  1  1987 NaN    63.0  23.5
##  2  1988 NaN    61.7  24.5
##  3  1989 NaN    59.7  26.1
##  4  1990 NaN    52.2  22.6
##  5  1991 NaN    63.1  21.4
##  6  1992 NaN    50.8  24.8
##  7  1993 NaN    44.3  25.8
##  8  1994 NaN    52.2  28.5
##  9  1995 NaN    66.6  27.3
## 10  1996 NaN    58.4  26.4
## 11  1997 NaN    56.5  25.5
## 12  1998  18.3  50.7  24.6
## 13  1999  18.5  57.5  24.7
## 14  2000  16.9  55.8  23.5
## 15  2001  16.9  51.8  25.1
## 16  2002  15.3  54.9  22.7
## 17  2003  15.2  56.2  24.6
## 18  2004  14.6  44.5  23.4
## 19  2005  16.2  58.8  22.6

In a slightly more complicated example, we might want to know what are the average levels of ozone (o3) and nitrogen dioxide (no2) within quintiles of pm25. A slicker way to do this would be through a regression model, but we can actually do this quickly with group_by() and summarize().

First, we can create a categorical variable of pm25 divided into quintiles.

qq <- quantile(chicago$pm25, seq(0, 1, 0.2), na.rm = TRUE)
chicago <- mutate(chicago, pm25.quint = cut(pm25, qq))
head(chicago)
##   city tmpd dewpoint       date     pm25 pm10tmean2  o3tmean2 no2tmean2
## 1 chic   35     30.1 2005-12-31 15.00000       23.5  2.531250  13.25000
## 2 chic   36     31.0 2005-12-30 15.05714       19.2  3.034420  22.80556
## 3 chic   35     29.4 2005-12-29  7.45000       23.5  6.794837  19.97222
## 4 chic   37     34.5 2005-12-28 17.75000       27.5  3.260417  19.28563
## 5 chic   40     33.6 2005-12-27 23.56000       27.0  4.468750  23.50000
## 6 chic   35     29.6 2005-12-26  8.40000        8.5 14.041667  16.81944
##   pm25detrend year  pm25.quint
## 1   -1.230958 2005 (12.4,16.7]
## 2   -1.173815 2005 (12.4,16.7]
## 3   -8.780958 2005   (1.7,8.7]
## 4    1.519042 2005 (16.7,22.6]
## 5    7.329042 2005 (22.6,61.5]
## 6   -7.830958 2005   (1.7,8.7]

Now we can group the data frame by the pm25.quint variable.

quint <- group_by(chicago, pm25.quint)
## Warning: Factor `pm25.quint` contains implicit NA, consider using
## `forcats::fct_explicit_na`
head(quint)
## # A tibble: 6 x 11
## # Groups:   pm25.quint [4]
##   city   tmpd dewpoint date        pm25 pm10tmean2 o3tmean2 no2tmean2
##   <chr> <dbl>    <dbl> <date>     <dbl>      <dbl>    <dbl>     <dbl>
## 1 chic     35     30.1 2005-12-31 15          23.5     2.53      13.2
## 2 chic     36     31   2005-12-30 15.1        19.2     3.03      22.8
## 3 chic     35     29.4 2005-12-29  7.45       23.5     6.79      20.0
## 4 chic     37     34.5 2005-12-28 17.8        27.5     3.26      19.3
## 5 chic     40     33.6 2005-12-27 23.6        27       4.47      23.5
## 6 chic     35     29.6 2005-12-26  8.4         8.5    14.0       16.8
## # ... with 3 more variables: pm25detrend <dbl>, year <dbl>,
## #   pm25.quint <fct>

Finally, we can compute the mean of o3 and no2 within quintiles of pm25.

summarize(quint, o3 = mean(o3tmean2, na.rm = TRUE), 
           no2 = mean(no2tmean2, na.rm = TRUE))
## # A tibble: 6 x 3
##   pm25.quint     o3   no2
##   <fct>       <dbl> <dbl>
## 1 (1.7,8.7]    21.7  18.0
## 2 (8.7,12.4]   20.4  22.1
## 3 (12.4,16.7]  20.7  24.4
## 4 (16.7,22.6]  19.9  27.3
## 5 (22.6,61.5]  20.3  29.6
## 6 <NA>         18.8  25.8

From the table, it seems there isn’t a strong relationship between pm25 and o3, but there appears to be a positive correlation between pm25 and no2. More sophisticated statistical modeling can help to provide precise answers to these questions, but a simple application of dplyr functions can often get you most of the way there.

%>%

The pipeline operater %>% is very handy for stringing together multiple dplyr functions in a sequence of operations. Notice above that every time we wanted to apply more than one function, the sequence gets buried in a sequence of nested function calls that is difficult to read, i.e.

third(second(first(x)))

This nesting is not a natural way to think about a sequence of operations. The %>% operator allows you to string operations in a left-to-right fashion, i.e.

first(x) %>% second %>% third

Take the example that we just did in the last section where we computed the mean of o3 and no2 within quintiles of pm25. There we had to

  1. create a new variable pm25.quint
  2. split the data frame by that new variable
  3. compute the mean of o3 and no2 in the sub-groups defined by pm25.quint

That can be done with the following sequence in a single R expression.

mutate(chicago, pm25.quint = cut(pm25, qq)) %>%    
         group_by(pm25.quint) %>% 
         summarize(o3 = mean(o3tmean2, na.rm = TRUE), 
                   no2 = mean(no2tmean2, na.rm = TRUE))
## Warning: Factor `pm25.quint` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## # A tibble: 6 x 3
##   pm25.quint     o3   no2
##   <fct>       <dbl> <dbl>
## 1 (1.7,8.7]    21.7  18.0
## 2 (8.7,12.4]   20.4  22.1
## 3 (12.4,16.7]  20.7  24.4
## 4 (16.7,22.6]  19.9  27.3
## 5 (22.6,61.5]  20.3  29.6
## 6 <NA>         18.8  25.8

This way we don’t have to create a set of temporary variables along the way or create a massive nested sequence of function calls.

Notice in the above code that we pass the chicago data frame to the first call to mutate(), but then afterwards we do not have to pass the first argument to group_by() or summarize(). Once you travel down the pipeline with %>%, the first argument is taken to be the output of the previous element in the pipeline.

Another example might be computing the average pollutant level by month. This could be useful to see if there are any seasonal trends in the data.

mutate(chicago, month = as.POSIXlt(date)$mon + 1) %>% 
         group_by(month) %>% 
         summarize(pm25 = mean(pm25, na.rm = TRUE), 
                   o3 = max(o3tmean2, na.rm = TRUE), 
                   no2 = median(no2tmean2, na.rm = TRUE))
## # A tibble: 12 x 4
##    month  pm25    o3   no2
##    <dbl> <dbl> <dbl> <dbl>
##  1     1  17.8  28.2  25.4
##  2     2  20.4  37.4  26.8
##  3     3  17.4  39.0  26.8
##  4     4  13.9  47.9  25.0
##  5     5  14.1  52.8  24.2
##  6     6  15.9  66.6  25.0
##  7     7  16.6  59.5  22.4
##  8     8  16.9  54.0  23.0
##  9     9  15.9  57.5  24.5
## 10    10  14.2  47.1  24.2
## 11    11  15.2  29.5  23.6
## 12    12  17.5  27.7  24.5

Here we can see that o3 tends to be low in the winter months and high in the summer while no2 is higher in the winter and lower in the summer.

Working with ggplot2

The package ggplot2 is also available with the tidyverse library

head(chicago)
##   city tmpd dewpoint       date     pm25 pm10tmean2  o3tmean2 no2tmean2
## 1 chic   35     30.1 2005-12-31 15.00000       23.5  2.531250  13.25000
## 2 chic   36     31.0 2005-12-30 15.05714       19.2  3.034420  22.80556
## 3 chic   35     29.4 2005-12-29  7.45000       23.5  6.794837  19.97222
## 4 chic   37     34.5 2005-12-28 17.75000       27.5  3.260417  19.28563
## 5 chic   40     33.6 2005-12-27 23.56000       27.0  4.468750  23.50000
## 6 chic   35     29.6 2005-12-26  8.40000        8.5 14.041667  16.81944
##   pm25detrend year  pm25.quint
## 1   -1.230958 2005 (12.4,16.7]
## 2   -1.173815 2005 (12.4,16.7]
## 3   -8.780958 2005   (1.7,8.7]
## 4    1.519042 2005 (16.7,22.6]
## 5    7.329042 2005 (22.6,61.5]
## 6   -7.830958 2005   (1.7,8.7]

Histogram

To summarize continuous variable using a series of bars, we need to divide the observations into groups, or bins, and count how many are in each one.

By default, the geom_histogram() function will choose a bin size for us based on a rule of thumb.

p <- ggplot(data = chicago,
            mapping = aes(x = pm25))
p + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4447 rows containing non-finite values (stat_bin).

p <- ggplot(data = chicago,
            mapping = aes(x = pm25))
p + geom_histogram(bins = 10)
## Warning: Removed 4447 rows containing non-finite values (stat_bin).

Change line color and fill color

p <- ggplot(data = chicago, 
            mapping = aes(x=pm25)) +
  geom_histogram(color="darkblue", fill="lightblue")
p
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4447 rows containing non-finite values (stat_bin).

Change line type

p <- ggplot(data = chicago, 
            mapping = aes(x=pm25)) +
  geom_histogram(color="black", fill="lightblue",
                 linetype="dashed")
p
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4447 rows containing non-finite values (stat_bin).

p <- ggplot(data = chicago, 
            mapping = aes(x=pm25)) +
  geom_histogram(color="black", fill="pink",
                 linetype="dashed") +
  labs(title="PM2.5 histogram plot",x="PM2.5", y = "Count")+
  theme_classic()
p
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4447 rows containing non-finite values (stat_bin).

It’s also possible to use several at once to compare distributions. We can facet histograms by some variable of interest, or as here we can compare them in the same plot using the fill mapping.

pm25.quint_ <- unique(na.omit(chicago$pm25.quint))
pm25.quint_
## [1] (12.4,16.7] (1.7,8.7]   (16.7,22.6] (22.6,61.5] (8.7,12.4] 
## Levels: (1.7,8.7] (8.7,12.4] (12.4,16.7] (16.7,22.6] (22.6,61.5]
p <- ggplot(data = subset(chicago, subset = pm25.quint %in% pm25.quint_),
            mapping = aes(x = dewpoint, fill = pm25.quint))
p + geom_histogram(alpha = 0.4, bins = 30)
## Warning: Removed 2 rows containing non-finite values (stat_bin).

When working with a continuous variable, an alternative to binning the data and making a histogram is to calculate a kernel density estimate of the underlying distribution. The geom_density() function will do this for us.

p <- ggplot(data = subset(chicago, subset = pm25.quint %in% pm25.quint_),
            mapping = aes(x = dewpoint, fill = pm25.quint, color = pm25.quint))
p + geom_density(alpha = 0.3)
## Warning: Removed 2 rows containing non-finite values (stat_density).

Yes, we did it again!!!

Congratulations ladies and thanks for coming today!

This presentation closely follows this source and this source